AITopics | navigation agent

Abstract--Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. T o address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-T uning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning These two authors contribute equally to this work. Bokui Chen, Cewu Lu, and Xiaodan Liang are the corresponding authors. Bingqian Lin and Cewu Lu are with Shanghai Jiao T ong University, Shanghai, China. Y unshuang Nie, Khun Loun Zai, and Ziming Wei are with Shenzhen Campus of Sun Y at-sen University, Shenzhen, China. Xiaodan Liang is with Shenzhen Campus of Sun Y at-sen University, Shenzhen, China, Peng Cheng Laboratory, Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou, 510006, China. Bokui Chen is with T singhua Shenzhen International Graduate School, T singhua University, China. Mingfei Han is with the Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2506.01551

Country:

Asia > China > Guangdong Province > Shenzhen (1.00)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.54)

Genre:

Research Report (1.00)
Personal (1.00)

Industry: Education > Educational Setting > Higher Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Luo, Zhihao, Yan, Wentao, Gong, Jingyu, Wang, Min, Zhang, Zhizhong, Wang, Xuhong, Xie, Yuan, Tan, Xin

arXiv.org Artificial IntelligenceOct-14-2025

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2508.02046

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions

Yang, Haolin, Long, Yuxing, Yu, Zhuoyuan, Yang, Zihan, Wang, Minghan, Xu, Jiapeng, Wang, Yihan, Yu, Ziyan, Cai, Wenzhe, Kang, Lei, Dong, Hao

arXiv.org Artificial IntelligenceOct-10-2025

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.08173

Genre: Research Report (0.51)

Technology: